Loading all the necessary libraries which are needed for this project. A brief explanation of each of the libraries used in this R Markdown script:
This research project aimed to identify the factors associated with depression and poor mental health rates in the United States. The study analyzed data from multiple sources, including the Behavioral Risk Factor Surveillance System, American Community Survey, and COVID-19 Community Profile Report. The data were used to build multiple linear regression models and regression trees and conduct stepwise selection to identify the optimal combination of features. The study found that demographic, socioeconomic, and health-related variables significantly predict depression and poor mental health rates. The models can serve as a basis for further research on this topic and assist policymakers in identifying high-risk populations and designing targeted interventions. However, the study has limitations such as the possibility of measurement bias, limited geographical areas, and not accounting for potential confounding variables. Further research could explore additional factors related to depression and conduct qualitative research to understand better the subjective experiences of individuals living with depression. Overall, this study provides insights into the factors associated with depression and poor mental health rates in the United States and can inform future research and policy decisions.
rm(list=ls())
library(readr)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(corrplot)
library(ezids)
library(car)
library(rpart)
library(rpart.plot)
library(rattle)
library(tree)
url <- 'https://raw.githubusercontent.com/eitanaka/DATS6101_Final_Project_Team2/main/data_set/geo_socio_health_df.csv'
master_df<- read_csv(url)
master_df <- master_df[,-1]
# To rename some columns in a data frame to make them more readable and easier to work with.This is done by using the "colnames" function to first select the column names that match the original names, and then assigning new names.
colnames(master_df)[colnames(master_df) == "MT_Never Married"] <- "mt.nev.mar"
colnames(master_df)[colnames(master_df) == "MT_Now married"] <- "mt.now.mar"
colnames(master_df)[colnames(master_df) == "Total Population"] <- "tot.pop"
colnames(master_df)[colnames(master_df) == "EA_Less than high school graduate"] <- "ea.less.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_High school graduate"] <- "ea.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_college or associate's degree"] <- "ea.col.ass.deg"
colnames(master_df)[colnames(master_df) == "EA_Bachelor's degree"] <- "ea.ba.deg"
colnames(master_df)[colnames(master_df) == "EA_Graduate or professional degree"] <- "ea.grad.prof.deg"
# Create a new data only including numerical variable
numeric_vars <- sapply(master_df, is.numeric)
num_df <- master_df[, numeric_vars]
Using a range of datasets, our previous project focused on the relationship between health risk behaviors: lack of physical activity and sleep, and health outcomes (depression) and status (poor mental health). In this research project, we investigate the influence of other health conditions, socioeconomic factors, and the economic impact of covid-19 on mental health in the United States. We aim to determine which factors significantly impact mental health and well-being in the US during 2020. To do so, we will analyze four different datasets at the census tract level, utilizing various statistical techniques such as correlation, linear regression, and decision tree analysis. The results of this study have important implications for improving mental health outcomes and shaping public health policies and interventions. We can develop targeted interventions and programs to address these issues by identifying the factors that most significantly impact mental health.
Our research project utilizes four different datasets to investigate the influence of health conditions, socioeconomic factors, and the economic impact of COVID-19 on mental health outcomes in the United States. One of these datasets is from our previous project (Project 1), while the other is collected specifically for this study.
CDC_PLACE dataset measures health outcomes, prevention, health risk behaviors, and health status. This dataset contains 13 health outcomes, nine prevention measures, four health risk behaviors, and three health status measures. The original dataset was launched by the Centers for Disease Control and Prevention (CDC)- In 2020, the dataset provided small area estimates for counties, places, census tracts, and ZIP Code Tabulation Areas across the United States. Each measure has a comprehensive definition that includes the background, significance, limitations of the indicator, data source, and limitations of the data resources.
The American Community Survey 2020 is another dataset used in our study. This ongoing survey provides detailed information on the population and housing in the US yearly. The ACS helps local officials, community leaders, and businesses understand the changes in their communities. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.
Planning Database 2020 contains operational, demographic, and socioeconomic statistics from the 2010 Census at both the tract and block group levels.CEII data also includes annualized monthly estimates of county-level value added for over 100 industries. Counties with economic activities dominated by industries experiencing rising unemployment can expect more enormous direct impacts on their local economies, notably if the industries account for a large portion of the economic output of that county.
County Economic Impact Index 2020 estimates the overall county-level economic activity change during the COVID-19 pandemic relative to 2020. These four datasets and the six others were analyzed to investigate the relationship between mental health outcomes and various socioeconomic and lifestyle factors, as well as the impact of COVID-19.
Our guiding questions for this project are the following:
SQ1: What factors do we observed highly associated with depression and poor mental health rates?
SQ 2: Using those factors, how accurately can we infer depression and poor mental health rates in a given tract?
Data cleaning and preparation is a crucial step in any data analysis project. In this step, we ensure that the data is in a format suitable for analysis and that any inconsistencies, errors, or missing values are corrected or removed. Our project merged the four data sources by countyFIPs and Tract level Geographical ID. This allowed us to combine information from different sources based on their standard identifiers. We also renamed variables to make them more meaningful and easier to understand. After merging and renaming variables, we removed all rows, including null values. This was important because null values could interfere with our analysis and lead to inaccurate conclusions. We also eliminated all outliers to reduce the influence of extreme values on our analysis. Once we completed these cleaning steps, we had a data frame with 12,444 observations of 57 variables. This data frame was ready for further analysis and exploration.
In this section, we perform EDA on a dataset containing health and socioeconomic factors for different tracts in the United States. The purpose of the analysis is to investigate the relationship between these factors and two health outcomes: depression rate and poor mental health rate. We check the data quality, including missing values, outliers, and inconsistencies. We then examine the distributions of the variables and their relationships using visualizations such as histograms, scatterplots, and correlation matrices. The EDA reveals that some variables are highly correlated with depression rates and poor mental health rates. Overall, the EDA helps us to understand the data, identify potential issues, and guide our subsequent analysis. The insights gained from this analysis have important implications for public health policy and model building.
The data structure of the dataset is a table with 60,682 rows and 57 columns. The columns represent various variables, including health indicators, demographic characteristics, and economic indicators. The data types of the variables are numeric, except for CountyFIPS, GEOID, StateAbbr, and CountyName. The dataset has also omitted some rows due to missing data, as indicated by the "na. action" attribute. The dataset is a comprehensive health and demographic information collection for a large population.
str(num_df)
## tibble [60,962 × 57] (S3: tbl_df/tbl/data.frame)
## $ ACCESS2 : num [1:60962] 15 19.9 12.5 18 19.6 16.7 16.5 21.1 12.7 17.8 ...
## $ BINGE : num [1:60962] 15.8 14.1 15.2 14.9 14.9 15.5 14.9 12.5 16.6 14.9 ...
## $ CHECKUP : num [1:60962] 74.1 76.5 75.5 74.6 74 74.3 75.2 78.3 73.9 75.5 ...
## $ DEPRESSION : num [1:60962] 25.9 23.9 24 26.9 28.5 26.2 25.9 24.5 24.6 25.9 ...
## $ DIABETES : num [1:60962] 10.7 13.4 10.2 12.8 12.8 11.3 12.8 17.4 9.7 13.8 ...
## $ LPA : num [1:60962] 26.4 32.1 23 31.1 33.4 28.2 29.8 37.2 22.8 29.6 ...
## $ MHLTH : num [1:60962] 16.4 17.3 14 17.9 19.1 17 17 17.7 15.1 16.9 ...
## $ OBESITY : num [1:60962] 37 43.9 33.2 41.1 41 37.9 40.3 46.1 35.7 36.2 ...
## $ PHLTH : num [1:60962] 11.3 12.1 9.8 13.5 14.5 11.6 12.8 15.4 9.8 13 ...
## $ SLEEP : num [1:60962] 36.9 43.4 33.4 39.5 39.8 38.1 39.2 43.8 35.8 37.6 ...
## $ STROKE : num [1:60962] 3.1 3.7 3 3.8 4 3.3 3.8 5.3 2.7 4 ...
## $ mt.nev.mar : num [1:60962] 23.7 39.5 20.9 27.5 36.9 ...
## $ mt.now.mar : num [1:60962] 54.8 31.8 63.8 52.9 37 ...
## $ MT_Divorces : num [1:60962] 11.97 18.56 10.09 9.62 18.61 ...
## $ MT_Separated : num [1:60962] 1.39 2.62 0 1.17 1.61 ...
## $ MT_Widowed : num [1:60962] 8.11 7.54 5.27 8.87 5.9 ...
## $ ea.less.hs.deg : num [1:60962] 14.29 10.59 6.23 12.86 17.91 ...
## $ ea.hs.deg : num [1:60962] 33.8 50.3 28.1 31.2 38.4 ...
## $ ea.col.ass.deg : num [1:60962] 27 23.2 26.4 34.3 28 ...
## $ ea.ba.deg : num [1:60962] 15.4 12.9 22.8 12.2 10.8 ...
## $ ea.grad.prof.deg : num [1:60962] 9.4 2.97 16.43 9.5 4.93 ...
## $ MI_Estimate : num [1:60962] 26504 26173 37529 20669 24181 ...
## $ tot.pop : num [1:60962] 1941 1757 3539 3536 3562 ...
## $ CT_<10 : num [1:60962] 17.68 5.65 13.98 21.38 19.81 ...
## $ CT_10-14 : num [1:60962] 27.6 17.54 8.91 13.07 12.08 ...
## $ CT_15-19 : num [1:60962] 7.02 3.48 13.78 16.25 15.53 ...
## $ CT_20-24 : num [1:60962] 14.29 7.68 14.37 10.42 20.91 ...
## $ CT_25-29 : num [1:60962] 5.57 0.87 8.26 5.39 6.07 ...
## $ CT_30-34 : num [1:60962] 18.8 39.9 22.1 18.6 16.1 ...
## $ CT_35-44 : num [1:60962] 7.02 14.2 7.87 7.51 3.59 ...
## $ CT_45-59 : num [1:60962] 1.69 8.41 3.64 3.27 4.55 ...
## $ CT_>60 : num [1:60962] 0.363 2.319 7.087 4.152 1.38 ...
## $ ES_Total_labor_force : num [1:60962] 54.7 50.3 54.5 44.9 58.9 ...
## $ ES_Civilian_labor_force : num [1:60962] 54.7 49.4 54.2 43.5 58.3 ...
## $ ES_Civilian_labor_force_employed : num [1:60962] 53.5 47.4 52.9 42.3 53.5 ...
## $ ES_Civilian_labor_force_unemployed: num [1:60962] 1.16 2 1.27 1.21 4.78 ...
## $ ES_Armed_Forces : num [1:60962] 0 0.828 0.327 1.423 0.56 ...
## $ ES_Not_in_labor_force : num [1:60962] 45.3 49.7 45.5 55.1 41.1 ...
## $ land : num [1:60962] 3.79 1.29 2.46 3.1 8.65 ...
## $ urban : num [1:60962] 1594 2170 4386 3595 2505 ...
## $ poverty : num [1:60962] 11.34 17.88 2.85 21.58 30.5 ...
## $ no.ins : num [1:60962] 9.26 9.37 2.82 14.87 16.23 ...
## $ disab : num [1:60962] 17.6 17.1 19.6 25.7 24.7 ...
## $ no.comp : num [1:60962] 11.23 17.36 7.6 8.45 9.54 ...
## $ broad&comp : num [1:60962] 80.9 79.2 86 87.6 85.1 ...
## $ no.eng : num [1:60962] 1.83 0.56 0.98 0.68 0.99 0 0 0 0 0 ...
## $ sing.mom : num [1:60962] 14.12 17.25 7.69 9.38 27.48 ...
## $ live.alone : num [1:60962] 22.9 37.3 28.4 18.7 28.5 ...
## $ pub.assist : num [1:60962] 0 1.11 1.22 0.68 4.51 2.16 1.27 0.45 1.47 0 ...
## $ no.phone : num [1:60962] 2.88 2.36 2.5 0.45 0.78 0.69 0.49 4.91 0.88 0.71 ...
## $ no.plumb : num [1:60962] 0 2 0.86 3.47 0 ...
## $ married.kid : num [1:60962] 36 49.2 36.3 43.6 50.1 ...
## $ hhd.no.comp : num [1:60962] 18.3 27.8 10.4 11.6 13.1 ...
## $ hhd.only.phone : num [1:60962] 6.01 5.7 4.27 5.41 3.59 ...
## $ hhd.no.int : num [1:60962] 26.5 30.5 17.5 17.2 17.7 ...
## $ hhd.broad : num [1:60962] 59.7 55.1 74.6 65.4 66 ...
## $ index_apr20 : num [1:60962] 0.913 0.913 0.913 0.913 0.913 ...
The table displays the summary statistics of various health and socioeconomic indicators in 2020. The mean values for several health indicators, such as depression and obesity, are high, indicating potential health concerns in the population. Education levels vary, with 12.2% having less than a high school degree and 28.8% having some college or associate’s degree. The poverty rate is 15.3%, and 9.1% of the population does not have health insurance. Additionally, 13.3% of households do not have a computer, and 28.1% of individuals live alone.
summary(num_df)
## ACCESS2 BINGE CHECKUP DEPRESSION DIABETES
## Min. : 2.6 Min. : 2.7 Min. :49.7 Min. : 8.5 Min. : 0.6
## 1st Qu.: 9.8 1st Qu.:14.7 1st Qu.:70.6 1st Qu.:17.8 1st Qu.: 8.5
## Median :13.3 Median :16.7 Median :74.8 Median :20.3 Median :10.3
## Mean :15.5 Mean :16.8 Mean :74.0 Mean :20.4 Mean :10.9
## 3rd Qu.:19.2 3rd Qu.:18.7 3rd Qu.:77.7 3rd Qu.:22.9 3rd Qu.:12.6
## Max. :64.9 Max. :36.6 Max. :90.8 Max. :37.8 Max. :42.2
##
## LPA MHLTH OBESITY PHLTH SLEEP
## Min. : 7.8 Min. : 6.1 Min. :11.5 Min. : 2.3 Min. :19.8
## 1st Qu.:18.7 1st Qu.:13.1 1st Qu.:28.0 1st Qu.: 8.4 1st Qu.:30.7
## Median :23.7 Median :15.0 Median :33.1 Median :10.3 Median :33.5
## Mean :24.6 Mean :15.1 Mean :33.0 Mean :10.8 Mean :34.0
## 3rd Qu.:29.5 3rd Qu.:16.9 3rd Qu.:37.7 3rd Qu.:12.7 3rd Qu.:36.7
## Max. :63.7 Max. :33.0 Max. :63.9 Max. :31.3 Max. :54.4
##
## STROKE mt.nev.mar mt.now.mar MT_Divorces MT_Separated
## Min. : 0.20 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 2.40 1st Qu.: 24.6 1st Qu.: 37.7 1st Qu.: 7.9 1st Qu.: 0.6
## Median : 3.00 Median : 31.5 Median : 48.2 Median : 10.7 Median : 1.5
## Mean : 3.19 Mean : 34.2 Mean : 46.7 Mean : 11.2 Mean : 1.9
## 3rd Qu.: 3.70 3rd Qu.: 41.4 3rd Qu.: 57.1 3rd Qu.: 13.9 3rd Qu.: 2.7
## Max. :20.50 Max. :100.0 Max. :100.0 Max. :100.0 Max. :41.8
## NA's :17 NA's :17 NA's :17 NA's :17
## MT_Widowed ea.less.hs.deg ea.hs.deg ea.col.ass.deg ea.ba.deg
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.8 1st Qu.: 4.9 1st Qu.:19.6 1st Qu.: 23.4 1st Qu.: 10.8
## Median : 5.6 Median : 9.4 Median :28.4 Median : 29.1 Median : 17.2
## Mean : 6.1 Mean :12.2 Mean :27.8 Mean : 28.8 Mean : 19.1
## 3rd Qu.: 7.7 3rd Qu.:16.6 3rd Qu.:36.1 3rd Qu.: 34.5 3rd Qu.: 26.3
## Max. :57.9 Max. :83.0 Max. :88.8 Max. :100.0 Max. :100.0
## NA's :17 NA's :25 NA's :25 NA's :25 NA's :25
## ea.grad.prof.deg MI_Estimate tot.pop CT_<10
## Min. : 0.0 Min. : 2499 Min. : 0 Min. : 0.0
## 1st Qu.: 4.7 1st Qu.: 25068 1st Qu.: 2805 1st Qu.: 6.0
## Median : 8.7 Median : 31500 Median : 3868 Median : 10.5
## Mean : 12.1 Mean : 34406 Mean : 4006 Mean : 13.5
## 3rd Qu.: 16.4 3rd Qu.: 40680 3rd Qu.: 5037 3rd Qu.: 17.4
## Max. :100.0 Max. :204750 Max. :39373 Max. :100.0
## NA's :25 NA's :71 NA's :93
## CT_10-14 CT_15-19 CT_20-24 CT_25-29 CT_30-34
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 7.9 1st Qu.: 9.8 1st Qu.: 9.2 1st Qu.: 3.2 1st Qu.: 8.2
## Median : 12.2 Median : 14.3 Median : 13.5 Median : 5.7 Median : 12.7
## Mean : 13.6 Mean : 15.2 Mean : 14.1 Mean : 6.3 Mean : 13.3
## 3rd Qu.: 17.9 3rd Qu.: 19.7 3rd Qu.: 18.3 3rd Qu.: 8.7 3rd Qu.: 17.6
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :64.6 Max. :100.0
## NA's :93 NA's :93 NA's :93 NA's :93 NA's :93
## CT_35-44 CT_45-59 CT_>60 ES_Total_labor_force
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.1 1st Qu.: 3.4 1st Qu.: 3.5 1st Qu.: 57.3
## Median : 6.0 Median : 6.9 Median : 7.0 Median : 63.7
## Mean : 6.8 Mean : 8.0 Mean : 9.3 Mean : 62.7
## 3rd Qu.: 9.6 3rd Qu.: 11.3 3rd Qu.: 12.6 3rd Qu.: 69.3
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :93 NA's :93 NA's :93 NA's :17
## ES_Civilian_labor_force ES_Civilian_labor_force_employed
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 57.0 1st Qu.: 53.1
## Median : 63.4 Median : 59.9
## Mean : 62.2 Mean : 58.7
## 3rd Qu.: 69.0 3rd Qu.: 65.6
## Max. :100.0 Max. :100.0
## NA's :17 NA's :17
## ES_Civilian_labor_force_unemployed ES_Armed_Forces ES_Not_in_labor_force
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1.8 1st Qu.: 0.0 1st Qu.: 30.7
## Median : 3.0 Median : 0.0 Median : 36.3
## Mean : 3.6 Mean : 0.4 Mean : 37.3
## 3rd Qu.: 4.6 3rd Qu.: 0.0 3rd Qu.: 42.7
## Max. :33.4 Max. :99.4 Max. :100.0
## NA's :17 NA's :17 NA's :17
## land urban poverty no.ins disab
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1 1st Qu.: 3 1st Qu.: 6.6 1st Qu.: 4.1 1st Qu.: 9.3
## Median : 2 Median : 2989 Median : 12.1 Median : 7.3 Median : 12.6
## Mean : 47 Mean : 2793 Mean : 15.3 Mean : 9.1 Mean : 13.5
## 3rd Qu.: 10 3rd Qu.: 4368 3rd Qu.: 20.9 3rd Qu.:12.3 3rd Qu.: 16.7
## Max. :85426 Max. :31777 Max. :100.0 Max. :74.7 Max. :100.0
## NA's :18 NA's :18 NA's :146 NA's :29 NA's :111
## no.comp broad&comp no.eng sing.mom
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.4 1st Qu.: 76.4 1st Qu.: 0.0 1st Qu.: 7.3
## Median : 6.8 Median : 85.0 Median : 1.6 Median : 11.3
## Mean : 8.5 Mean : 82.6 Mean : 4.6 Mean : 13.3
## 3rd Qu.: 11.6 3rd Qu.: 91.6 3rd Qu.: 5.4 3rd Qu.: 17.3
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :164 NA's :164 NA's :164
## live.alone pub.assist no.phone no.plumb married.kid
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 20.4 1st Qu.: 0.8 1st Qu.: 0.8 1st Qu.: 0.0 1st Qu.: 33.4
## Median : 27.1 Median : 1.8 Median : 1.7 Median : 0.9 Median : 40.7
## Mean : 28.1 Mean : 2.7 Mean : 2.2 Mean : 2.4 Mean : 41.2
## 3rd Qu.: 34.6 3rd Qu.: 3.6 3rd Qu.: 3.1 3rd Qu.: 3.0 3rd Qu.: 48.8
## Max. :100.0 Max. :53.5 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :171 NA's :164 NA's :163 NA's :213
## hhd.no.comp hhd.only.phone hhd.no.int hhd.broad
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 5.9 1st Qu.: 2.1 1st Qu.: 8.7 1st Qu.: 54.1
## Median : 10.9 Median : 4.7 Median : 15.6 Median : 68.0
## Mean : 12.5 Mean : 5.9 Mean : 17.4 Mean : 65.7
## 3rd Qu.: 17.2 3rd Qu.: 8.4 3rd Qu.: 23.8 3rd Qu.: 79.4
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :164 NA's :164 NA's :164
## index_apr20
## Min. :0.360
## 1st Qu.:0.855
## Median :0.885
## Mean :0.883
## 3rd Qu.:0.916
## Max. :1.194
##
When we check for null values, we confirm that there are null values in the data set in the following table. We then confirmed that they were a small number relative to the overall data, so we removed them all. To remove rows with null values in the columns listed below, we use the “na.omit” function. This function removes rows with any missing values in the specified columns. After removing the rows with null values, the resulting dataset only contains rows with complete data in the specified columns.
colSums(is.na(num_df))
## ACCESS2 BINGE
## 0 0
## CHECKUP DEPRESSION
## 0 0
## DIABETES LPA
## 0 0
## MHLTH OBESITY
## 0 0
## PHLTH SLEEP
## 0 0
## STROKE mt.nev.mar
## 0 17
## mt.now.mar MT_Divorces
## 17 17
## MT_Separated MT_Widowed
## 17 17
## ea.less.hs.deg ea.hs.deg
## 25 25
## ea.col.ass.deg ea.ba.deg
## 25 25
## ea.grad.prof.deg MI_Estimate
## 25 71
## tot.pop CT_<10
## 0 93
## CT_10-14 CT_15-19
## 93 93
## CT_20-24 CT_25-29
## 93 93
## CT_30-34 CT_35-44
## 93 93
## CT_45-59 CT_>60
## 93 93
## ES_Total_labor_force ES_Civilian_labor_force
## 17 17
## ES_Civilian_labor_force_employed ES_Civilian_labor_force_unemployed
## 17 17
## ES_Armed_Forces ES_Not_in_labor_force
## 17 17
## land urban
## 18 18
## poverty no.ins
## 146 29
## disab no.comp
## 111 164
## broad&comp no.eng
## 164 164
## sing.mom live.alone
## 164 164
## pub.assist no.phone
## 171 164
## no.plumb married.kid
## 163 213
## hhd.no.comp hhd.only.phone
## 164 164
## hhd.no.int hhd.broad
## 164 164
## index_apr20
## 0
num_df <- na.omit(num_df)
Next, we are handling outliers in the numerical data. We apply the outlierKD2 function from the “ezids” package to detect and remove the outliers in the dataset. We loop through all columns and remove outliers in each column by calling the outlierKD2 function. We remove any resulting NAs and update the data frame after each column. Finally, we remove the last two columns, which are added by outlierKD2. Handling outliers is essential in creating an accurate model because outliers can significantly impact the results, especially in regression models, and can lead to poor performance and inaccurate predictions.
new_num_df <- outlierKD2(num_df, num_df[[1]], rm=TRUE, boxplt=F, histogram=F,qqplt=F)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
# loop through all columns
for (col_name in colnames(num_df)[2:ncol(num_df)]) {
# remove outliers
new_num_df <- outlierKD2(new_num_df, new_num_df[[col_name]], rm=T, boxplt=F, histogram=F,qqplt=F)
new_num_df <- na.omit(new_num_df)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
}
new_num_df <- new_num_df[,1:(ncol(new_num_df)-1)]
The box plots show the distribution of various health, demographic, and socioeconomic factors. The plots suggest that obesity, diabetes, and depression rates are relatively high, while education levels and median household income are moderate. Looking at the distribution of all variables, it was confirmed that the distribution was relatively normal, thanks to removing outliers.
par(mfrow=c(8, 8))
par(mar=c(1, 1, 1, 1))
for (i in 1:ncol(new_num_df)) {
boxplot(new_num_df[[i]], main=names(new_num_df)[i])
}
We create scatter plots to visualize the relationship between depression and other variables, poor mental health, and other variables. The scatter plots show the distribution of each variable and the trend of the relationship between the variables. The data used in the plots come from a survey that collected information on various demographic, socioeconomic, and health-related factors.
Based on the scatter plot analysis of depression, it appears that depression is positively associated with “OBESITY,” “PHLTH,” and “ea.hs.deg.” On the other hand, there seems to be a negative association between depression and “MI_Estimate,” “ea.grad.prof.deg,” “CT_>60,” and “had.broad,” indicating that as these variables increase, the level of depression decreases. The strength of these associations varies, with “OBESITY,” “PHLTH,” and “MI_Estimate” having relatively stronger positive correlations with depression, while “ea.hs.deg” and “ea.grad.prof.deg” have weaker negative correlations.
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
if (col != "DEPRESSION") {
plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$DEPRESSION)
plot <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point(size=0.3) +
geom_smooth(method = "lm", se = FALSE) +
xlab(col) +
theme(legend.position = "none")
plot_list[[col]] <- plot
}
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)
The scatter plots with a fitted line for poor mental health (MHLTH) and the above highly correlated variables show that MHLTH positively correlates with PHLTH, OBESITY, and poverty. The scatter plot for PHLTH and MHLTH shows a clear positive linear relationship.
# Scatter plots with fit line for mhlth and other dependent variables
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
if (col != "MHLTH") {
plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$MHLTH)
plot <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point(size=0.3) +
geom_smooth(method = "lm", se = FALSE) +
xlab(col) +
theme(legend.position = "none")
plot_list[[col]] <- plot
}
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)
Next, we aim to determine the distribution of the dependent variables, depression rate, and poor mental health rate, to confirm which model would be appropriate for further analysis. The results showed that both depression and poor mental health rates were bell-shaped, indicating a normal distribution. This was confirmed through histograms and Q-Q plots, with both variables showing a close-to-straight-line pattern. The normal distribution of the dependent variables meets the assumption of linear regression, indicating that a linear regression model can be used for further analysis. This finding is significant as it allows for the exploration of the relationship between the dependent variables and the independent variables. Using linear regression can provide valuable insight into potential risk factors for depression and poor mental health, allowing for the development of targeted interventions to improve mental health outcomes.
# Histograms
hist(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
hist(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
# QQ plots for the distribution of tract-level depression rates and tract-level poor mental health rates.
qqnorm(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
qqline(new_num_df$DEPRESSION)
qqnorm(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
qqline(new_num_df$MHLTH)
We begin the process of selecting variables for our multiple linear regressions by generating correlation matrices for all of our health and socioeconomic variables. Doing this enables us to select only variables significantly associated with depression or poor mental health rates. We use a 0.35 Pearson’s correlation coefficient to generate both correlation matrices. Numerous moderate associations are revealed by doing this. However, it is notable that the associations with poor mental health are more robust than the associations with depression. The variables associated with depression all have correlation coefficients of |+- 0.35| or greater, while those associated with poor mental health all have correlation coefficients of |+- 0.55| or greater. Both health conditions are associated with some combination of health and socioeconomic variables.
# Create a correlation matrix
cor_matrix <- cor(new_num_df)
# Create two lists which have the names of variables highly correlated (more then 0.3 or less than -0.3)
high_dep_cor_list <- names(which(cor_matrix["DEPRESSION",] > 0.35 | cor_matrix["DEPRESSION",] < -0.35))
high_mhlth_cor_list <- names(which(cor_matrix["MHLTH",] > 0.5 | cor_matrix["MHLTH",] < -0.5))
high_dep_cor_list <- high_dep_cor_list[high_dep_cor_list != "MHLTH"]
high_mhlth_cor_list <- high_mhlth_cor_list[high_mhlth_cor_list != "DEPRESSION"]
# Create two correlation matrixes from new_num_df in the above lists
high_dep_cor_mat <- cor(new_num_df[high_dep_cor_list])
high_mhlth_cor_mat <- cor(new_num_df[high_mhlth_cor_list])
# Plot above correlation matrix
corrplot(high_dep_cor_mat, type = "lower", outline.color = "white",
colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation",
ggtheme = theme_gray, title = "Highly Correlated Variables with DEPRESSION")
corrplot(high_mhlth_cor_mat, type = "lower", outline.color = "white",
colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation",
ggtheme = theme_gray, title = "Highly Correlated Variables with mhlth")
my_col <- colorRampPalette(c("red", "white", "blue"))(30)
corrplot(high_dep_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))
corrplot(high_mhlth_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))
Next, we check the selected variables for multicollinearity by using a Variance Inflation Factor (VIF) test. If multiple independent variables have a multicollinear relationship, we cannot use them for our multiple regressions. Fortunately, all of our selected independent variables have VIF values below 5 (except for LPA with MHLTH, VIF value of 5.02). This means that there is little multicollinearity between our independent variables, and we can use them all in our subsequent multiple regressions.
calculate_VIF <- function(data, target_col) {
X <- data[, !colnames(data) %in% target_col]
vif <- data.frame(
Feature = colnames(X),
VIF = apply(X, 2, function(x) vif(lm(x ~ ., data=X)))
)
return(vif)
}
depression_VIF <- calculate_VIF(new_num_df[high_dep_cor_list], "DEPRESSION")
mhlth_VIF <- calculate_VIF(new_num_df[high_mhlth_cor_list], "MHLTH")
depression_VIF
## Feature VIF.OBESITY VIF.PHLTH VIF.ea.hs.deg
## OBESITY OBESITY 1.97 1.97 1.97
## PHLTH PHLTH 2.87 2.87 2.87
## ea.hs.deg ea.hs.deg 2.57 2.57 2.57
## ea.grad.prof.deg ea.grad.prof.deg 2.71 2.71 2.71
## MI_Estimate MI_Estimate 2.65 2.65 2.65
## `CT_>60` CT_>60 1.14 1.14 1.14
## hhd.broad hhd.broad 1.98 1.98 1.98
## VIF.ea.grad.prof.deg VIF.MI_Estimate VIF.CT_.60 VIF.hhd.broad
## OBESITY 1.97 1.97 1.97 1.97
## PHLTH 2.87 2.87 2.87 2.87
## ea.hs.deg 2.57 2.57 2.57 2.57
## ea.grad.prof.deg 2.71 2.71 2.71 2.71
## MI_Estimate 2.65 2.65 2.65 2.65
## `CT_>60` 1.14 1.14 1.14 1.14
## hhd.broad 1.98 1.98 1.98 1.98
mhlth_VIF
## Feature VIF.LPA VIF.OBESITY VIF.PHLTH VIF.SLEEP
## LPA LPA 5.02 5.02 5.02 5.02
## OBESITY OBESITY 2.07 2.07 2.07 2.07
## PHLTH PHLTH 4.33 4.33 4.33 4.33
## SLEEP SLEEP 2.11 2.11 2.11 2.11
## ea.hs.deg ea.hs.deg 3.74 3.74 3.74 3.74
## ea.ba.deg ea.ba.deg 3.68 3.68 3.68 3.68
## ea.grad.prof.deg ea.grad.prof.deg 2.70 2.70 2.70 2.70
## MI_Estimate MI_Estimate 2.83 2.83 2.83 2.83
## poverty poverty 1.78 1.78 1.78 1.78
## VIF.ea.hs.deg VIF.ea.ba.deg VIF.ea.grad.prof.deg
## LPA 5.02 5.02 5.02
## OBESITY 2.07 2.07 2.07
## PHLTH 4.33 4.33 4.33
## SLEEP 2.11 2.11 2.11
## ea.hs.deg 3.74 3.74 3.74
## ea.ba.deg 3.68 3.68 3.68
## ea.grad.prof.deg 2.70 2.70 2.70
## MI_Estimate 2.83 2.83 2.83
## poverty 1.78 1.78 1.78
## VIF.MI_Estimate VIF.poverty
## LPA 5.02 5.02
## OBESITY 2.07 2.07
## PHLTH 4.33 4.33
## SLEEP 2.11 2.11
## ea.hs.deg 3.74 3.74
## ea.ba.deg 3.68 3.68
## ea.grad.prof.deg 2.70 2.70
## MI_Estimate 2.83 2.83
## poverty 1.78 1.78
After going through this initial testing, we are left with seven variables for our depression regression [Obesity (%), Poor Physical health >= 14 days(%), (Last Academic Record) High School Graduate (%), (Academic Record) Graduate or professional degree (%), Median Income (US dollar), Commute Time >= 60 mins (%), and Broadband access (%] and eight variables for our poor mental health regression [Lack of Physical Activity (%), Obesity (%), Poor Physical health >= 14 days(%), Poor Sleep >= 14 days (%), (Last Academic Record) High School Graduate (%), (Academic Record) Graduate or professional degree (%), Median Income (US$), and Below Poverty Level (%)].
dep_features_list <- high_dep_cor_list[high_dep_cor_list != "DEPRESSION"]
mhlth_features_list <- high_mhlth_cor_list[high_mhlth_cor_list != "MHLTH"]
dep_features_list
## [1] "OBESITY" "PHLTH" "ea.hs.deg" "ea.grad.prof.deg"
## [5] "MI_Estimate" "CT_>60" "hhd.broad"
mhlth_features_list
## [1] "LPA" "OBESITY" "PHLTH" "SLEEP"
## [5] "ea.hs.deg" "ea.ba.deg" "ea.grad.prof.deg" "MI_Estimate"
## [9] "poverty"
At this point, we split our initial data set into training and testing data sets. We “train” our multiple linear regressions and regression trees on this training set. We then test whether our models are over or underfit by feeding the testing data set into them and comparing the results to the output we got from feeding the initial training data into them.
library(caret)
set.seed(123)
fold <- floor(runif(nrow(new_num_df),1,11))
new_num_df$fold <- fold
test.set <- new_num_df[new_num_df$fold == 1,]
train.set <- new_num_df[new_num_df$fold != 1,]
We experiment with three different multiple linear regressions and compare their performance using Analysis of Variance (ANOVA) testing. The first of these regressions (Model 1) predicts depression and poor mental health rates by incorporating all of the highly correlated Barbies (for each condition) into the regressions. The second regression (Model 2) only uses a cluster of highly correlated health variables to predict depression and poor mental health rates. Furthermore, finally, the third regression (Model 3) uses a cluster of highly correlated socioeconomic variables to predict depression and poor mental health rates.
Naturally, Model 1 performs the best overall, incorporating the most significant number of highly correlated variables. In addition, it has a lower residual sum of squares (RSS) than the other models, indicating that it fits the data best and that there is the least amount of unexplained error in the dependent variable variation when using Model 1. In other words, Model 1 can explain more of the variation in depression and mental health rates than the other two models.
Nevertheless, we are more interested in comparing the performance of Model 2 to Model 3. This comparison shows that Model 2 (the health variables model) performs slightly better than Model 3 (the socioeconomic variables model). While they have RSS values close to one another, Model 2 has a lower degree of freedom (it uses fewer variables in the regression). This is good. Achieving similar predictive accuracy with fewer variables means you have likely isolated some key predictive variables. The ANOVA test shows that the difference between Model 2 and the other models is statistically significant at the 1% level (p-value less than 0.01). In sum, although it is better to incorporate both the highly correlated health variables and the highly correlated socioeconomic variables, the health variables seem more predictive than the socioeconomic variables.
# Perform two-way ANOVA test
depression_anova_model <- anova(depression_model_1, depression_model_2, depression_model_3)
mhlth_anova_model <- anova(mhlth_model_1, mhlth_model_2, mhlth_model_3)
xkabledply(depression_anova_model)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 11204 | 69606 | NA | NA | NA | NA |
| 11209 | 77498 | -5 | -7893 | 254 | 0 |
| 11206 | 77847 | 3 | -349 | NA | NA |
xkabledply(mhlth_anova_model)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 11202 | 11872 | NA | NA | NA | NA |
| 11207 | 15343 | -5 | -3471 | 655 | 0 |
| 11206 | 16325 | 1 | -982 | NA | NA |
# health.socioecon.dep.anova <- anova(depression_model_2, depression_model_3)
# health.socioecon.mhlth.anova <- anova(mhlth_model_2, mhlth_model_3)
#
# xkabledply(health.socioecon.dep.anova)
# xkabledply(health.socioecon.mhlth.anova)
In the initial model, we included all variables with high correlation coefficients. However, the stepwise model identified a smaller set of predictor variables that are highly significant in predicting depression and poor mental health rates. We performed a stepwise model using the AIC criterion to identify the optimal combination of features. AIC accounts for the number of parameters in the model, which helps to prevent overfitting and avoid overly complex models.
# Perform stepwise selection using AIC as the selection criterion
depression_stepwise_model <- step(depression_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")
mhlth_stepwise_model <- step(mhlth_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")
The stepwise model had fewer variables and demonstrated similar performance as the initial model when compared using an ANOVA test. The residual sum of squares (RSS) for the depression and poor mental health models were almost identical. Therefore, we chose the stepwise model over the initial model as it was more parsimony.
dep_anova_model_2 <- anova(depression_model_1, depression_stepwise_model)
mhlth_anova_model_2 <- anova(mhlth_model_1, mhlth_stepwise_model)
xkabledply(dep_anova_model_2)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 11204 | 69606 | NA | NA | NA | NA |
| 11206 | 69632 | -2 | -26.5 | 2.13 | 0.118 |
xkabledply(mhlth_anova_model_2)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 11202 | 11872 | NA | NA | NA | NA |
| 11203 | 11873 | -1 | -1.23 | 1.16 | 0.282 |
This section summarizes a summary of the models we considered to be the best performing based on the model selection to date.
For depression rates, our model identified five significant predictor variables: “OBESITY Rate”, “Poor physical health”, “high school graduate as highest level educational attainment”, “CT_>60”, and “percent of households interacting with the internet”. The model explained 37.6% of the variability in depression levels, which is a moderate amount.
For poor mental health rates, our model identified seven significant predictor variables: “LPA”, “Obesity”, “PHLTH”, “Sleep”, “ea.graduate”, “MI_income”, and “poverty”. The model explained 67.6% of the variability in poor mental health levels, indicating a strong relationship between the predictor variables and poor mental health.
Overall, our findings suggest that demographic, socio-economic, and health-related variables play a significant role in predicting depression and poor mental health rates in the United States. Our models can serve as a basis for further research on this topic and can assist policymakers in identifying high-risk populations and designing targeted interventions.
dep_predictions <- predict(depression_model_1, newdata = test.set)
# calculate the R-squared value
r_squared <- summary(depression_model_1)$r.squared
# calculate the mean squared error (MSE)
mse <- mean((test.set$DEPRESSION - dep_predictions)^2)
# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")
mhlth_predictions <- predict(mhlth_stepwise_model, newdata = test.set)
# calculate the R-squared value
r_squared <- summary(mhlth_stepwise_model)$r.squared
# calculate the mean squared error (MSE)
mse <- mean((test.set$MHLTH - mhlth_predictions)^2)
# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")
We then generate our regression trees using our training data set. Because we do not preemptively select highly correlated variables (as we did when building the multiple regression models), we see a greater variety of predictive independent variables emerge. These include the economic index showing the effect of COVID, binge drinking (%), no proficient English speakers in households (%), never married (%), married with kid (%), attainment of graduate or professional degree (%), and having gotten a recent health checkup (%).
# Building a Regression Tree
# Step 1. Use recursive binary splitting to grow a large tree of depression on the training data
train.dep.tree <- rpart(DEPRESSION ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set, method = "anova")
# Grow large tree of mhlth on the training data
train.mhlth.tree <- rpart(MHLTH ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set)
# Plot tree model
fancyRpartPlot(train.dep.tree, main = "Depression Rate Tree")
fancyRpartPlot(train.mhlth.tree, main = "Poor Mental Health Rate Tree")
# Print summary
summary(train.dep.tree)
summary(train.mhlth.tree)
Our depression rate regression tree has an R-squared value of 0.348 and a mean squared error (MSE) of 6.12 (indicating the difference between the actual and predicted values for the dependent variable). This indicates that approximately 35 percent of the variation in the depression rate dependent variable is associated with variation in the independent variables. On the other hand, our mental health rate tree has an R-squared value of 0.584 and an MSE of 1.35. So in comparison to one another, the mental health regression tree performs better.
However, by these metrics, the performance of both regression trees is inferior to that of their corresponding multiple linear regressions. For example, the depression multiple linear regression has an R-squared value of 0.376 and an MSE of 5.78. The poor mental health multiple linear regression has an R-squared value of 0.584 and an MSE of 1.35. According to these metrics, both regressions are superior to their corresponding regression trees.
# Generate predicted values for test set
test.set$pred.depression <- predict(train.dep.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$DEPRESSION - test.set$pred.depression)^2)
R2 <- 1 - sum((test.set$DEPRESSION - test.set$pred.depression)^2) / sum((test.set$DEPRESSION - mean(test.set$DEPRESSION))^2)
# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")
# The R-squared value of 0.399 suggests that the regression tree model explains 39.9% of the variation in the data, which is a moderate level of explanation.
# The MSE (Mean Squared Error) of 6.05 represents the average squared difference between the predicted values and the actual values, with a higher value indicating worse performance.
test.set$pred.mhlth <- predict(train.mhlth.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$MHLTH - test.set$pred.mhlth)^2)
R2 <- 1 - sum((test.set$MHLTH - test.set$pred.mhlth)^2) / sum((test.set$MHLTH - mean(test.set$MHLTH))^2)
# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")
# The regression tree model with MHLTH as the target variable has an R-squared value of 0.685, indicating that 68.5% of the variability in the MHLTH variable can be explained by the model.
# The mean squared error (MSE) is 1.07, which means that on average, the model's predictions are off by 1.07 units from the actual values. .
# The root mean squared error (RMSE) is 1.04, which is the same as the standard deviation of the residuals.
# This indicates that the model's predictions have a standard deviation of 1.04 around the actual values. Overall, these metrics suggest that the model has a moderate level of predictive power for the MHLTH variable.
Based on the tests and modeling performed previously, the factors that are highly associated with depression and poor mental health rates are:
Depression: Obesity, poor physical health, high school degree as the highest level of educational attainment, commute time >= 60 min, households without high-speed wireless internet, binge drinking, the economic effect of COVID-19, and no English. Poor Mental Health: Lack of physical activity, obesity, poor physical health, lack of sleep, BA as highest level educational attainment, graduate or professional degree as the highest level of educational attainment, never married, married with a kid, recent checkup, median income, poverty rate, and binge drinking. We can infer depression and poor mental health rates in a given tract based on the above factors with some accuracy. For depression, the variation of these factors is slightly associated (r-square: 0.37) with the variation in depression rate. For poor mental health, the variation of these factors is highly associated (r-square: 0.69) with the variation of poor mental health rate.
In this section, we discuss the limitations of our research project and propose potential areas for further research.
Based on the information we provided earlier, there are several limitations to this project: - The measure of the poor mental state over the last two weeks may need to be more precise because the data was combined from different datasets that were gathered more than two weeks apart. This could introduce measurement bias and affect the accuracy of the results. - The project should have performed pruning or training regression trees, which could have resulted in less accurate models. - The feature selection process used in the analysis could have been more optimal, which could result in irrelevant or redundant variables being included in the model, reducing its predictive power. The data used in the analysis may be representative of only some of the population, as it was collected from a specific geographic region or sample size. - The project was limited to some geographical regions, which may not be representative of the broader population or may have limited the ability to identify additional factors associated with depression and poor mental health. - The project did not account for potential confounding variables like the impact of cultural or social factors that could influence the relationship between the identified factors and depression or poor mental health, leading to the possibility of overestimating or underestimating the actual effect of these factors, which may be an essential factor to consider in future research.
Some potential further research directions includes: - Conduct micro-level analysis: Further research could explore the social systems and relationships contributing to depression. This could involve examining individual-level factors, such as social support, family dynamics, and work stress, and how they impact mental health. - Integrate genetic information and other biological measures, such as hormonal imbalances and brain imaging. There is evidence that genetics play a role in the development of depression. Further research could examine genetic information and other biological factors related to depression. - Conduct more feature selection methods to identify the most important predictors of depression and poor mental health. As mentioned earlier, the feature selection process used in this project was limited. Further research could involve more sophisticated feature selection methods, such as principal component analysis or regularized regression techniques. - Explore additional factors: The current analysis focused on a limited set of factors in the dataset. Future research could explore additional factors related to depression, such as regular diet and lifestyle choices, and investigate their relationship to depression and poor mental health rates. - Time-series analysis: The current analysis did not consider the temporal aspect of depression and mental health rates. Future research could incorporate time-series analysis to examine how depression and mental health rates change and identify seasonal or cyclical trends that may shed light on environmental or societal changes. - Conduct qualitative research, such as interviews and focus groups, to better understand the subjective experiences of individuals living with depression and gain insights into potential risk factors that quantitative measures may not capture. - Explore the potential role of social support networks, such as family and friends, in mitigating or exacerbating depression and poor mental health, and develop interventions to strengthen social support.
In conclusion, this research project aimed to identify the factors associated with depression and poor mental health rates in the United States. The study utilized multiple linear regression models, regression trees, and ANOVA testing to analyze demographic, socioeconomic, and health-related variables. The results indicated that factors such as obesity, poor physical health, high school degree as the highest educational attainment, lack of physical activity, lack of sleep, and binge drinking were significantly associated with depression and poor mental health rates. Additionally, the project identified several limitations, including potential measurement bias, incomplete feature selection, and geographical restrictions, which may impact the accuracy of the results. The study contributes to the growing body of literature on the social determinants of mental health and provides a basis for further research and policy interventions. The findings can assist policymakers in identifying high-risk populations and designing targeted interventions. Future research directions include exploring the role of social support networks in mitigating or exacerbating depression, conducting qualitative research to understand better the subjective experiences of individuals living with depression, and integrating genetic and biological measures into the analysis. By identifying the critical factors associated with depression and poor mental health rates, this study provides valuable insights into the underlying mechanisms of mental health disparities. It offers potential solutions for addressing this pressing public health issue.
https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh https://gisgeography.com/us-county-map/ https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data/ https://data.census.gov/ https://www.anl.gov/dis/county-economic-impact-index https://www.census.gov/programs-surveys/community-resilience-estimates/data/datasets.html https://github.com/NREL/hsds-examples
GitHub Repo: https://github.com/eitanaka/DATS6101_Final_Project_Team2